Extracting Names From Arabic Text for Question-Answering Systems
نویسنده
چکیده
Tagging and extracting proper names is an important key for improving the effectiveness of questionanswering systems. The valuable information in the text usually is located around proper names, to collect this information it should be found first. By extracting proper names from the text we provide questionanswering systems with both the proper name found in the text, some information about it and where it was found. The proper names in Arabic do not start with capital letter as in many other languages so special treatment is needed to find them in a text. Little research has been conducted in this area; most efforts have been based on a number of heuristic rules used to find names in the text. In this paper we present a new technique to extract names from text by building a database and graphs to represent the words that might form a name and the relationships between them. First we mark the phrases that might include names, second we build graphs to represent the words in these phrases and the relationships between them, third we apply rules to find the names. Introduction Names play a very important role in many areas in natural language engineering, especially in questionanswering systems, text summarization, text classifications, information retrieval systems and information extraction (Cowie and Lehnert, 1996). (Rau, 1991) argues that names not only account for a large percentage of the unknown words in a text, but are also recognized as a crucial source of information in a text for extracting contents, identifying a topic in a text, or detecting relevant documents in information retrieval systems. Many researchers have attacked this problem in a variety of languages but only a few limited research projects have focused on natural language processing problems for Arabic. (Mehdi, 1986) describes a computer system for syntactic parsing of Arabic sentences. The system is implemented using a Definite Clause Grammar (DCG) formalism in Prolog. (Ibrahim, Douglas, and Faahmy, 1989) have suggested a framework to deal with the morphology of the Arabic language. (Foxley and Feddag, 1991) adopted a strategy of combining affixes to alleviate the operation overhead of affix manipulation routines. (Feddag and Foxley, 1990) provided a single powerful framework for an intelligent database where the system stores only the roots of the verbs and uses a program intelligent enough to automatically handle all derived forms. (Wacholder et al., 1997) analyzed the types of ambiguity structural and semantic that make the discovery of proper names in the text difficult. (Kim and Evens, 1995) built a natural language processing system for extracting personal names and other proper nouns from the Wall Street Journal. (Yangerber et al., 2002) presented an algorithm, called NOMEN for learning generalized names in text. NOMEN uses a novel form of bootstrapping to grow sets of textual instances and of their contextual patterns. (Abuleil and Evens, 2002) built a parser that uses a set of rules to parse the Arabic text, tag the proper nouns, and extract information about them. As defined in the Message Understanding Conference (Chinchor, 1998), names recognition consists in identifying and categorizing entity names (person, organization, location), temporal expressions (dates and times), and some types of numerical expressions (percentages, monetary values and so on), which are considered to constitute up to 10% of written texts (Coastes-Stephens, 1992). According to (Mcdonald, 1996), there are two kinds of data that should be taken into account in order to identify and classify the possible names: internal evidence and external evidence. The former is provided by the expression itself and the latter by the context in which it occurs. Among the different techniques used to process these data, we find some systems based on statistics methods, such as Hidden Markov Models (Bikel, et al.,1999) some based on strictly linguistics methods which make use of grammar rules (Magnini, et al,. 2002), and finally the ones that combine rules and statistics (Mikheev, et al., 1998). Collecting and adding information to the database is an important issue to improve the effectiveness of question-answering systems. The valuable information in the text usually is located around proper names, to collect this information we should find it first. In this paper we build a technique to extract the proper names from the text and provide question-answering systems with each name found, some information about it and its locations in the text. Proper Names in Arabic The problem of identifying proper names is particularly difficult for Arabic, since names in the Arabic language do not start with capital letters so we can not mark them in the text by looking at the first letter of the word. To tag proper names in Arabic text we use keywords to guide us to the place where we can find them in the text. By using keywords we mark name phrases that might contain a certain name then we process these phrases to extract names. One way to process these phrases and extract the names is to construct a bunch of heuristic rules and use them to parse the phrase to extract the name. This technique has many limitations: it is hard to tell exactly where the name starts in the phrase and where it ends especially for foreign names. No matter how many rules you add to the system you will never cover all the scenarios that you might face, since each person writes in a different way with a different style, so the same name phrase can be written in many different ways. In this paper we describe a new technique to process the name phrases to extract the names from the text. This technique is based on the relationships between the words in the name phrases by building a directed graph that represents the words as nodes and the relationships between them as weights on the edges. The relationship (weight) between two words represents the number of times these two words appear attached to each other in the name phrases. In this paper we focus on the names, the names that appear several times in the text and not on names that appear only once or twice in thousands of documents. The rest of the paper answers two major questions: where we can find names in the text and how to extract them. Where to Look for Names in the Text We generated a set of rules to predict where the names are located in the text. These rules are based on two things: the keyword and some special verbs. Names seem to appear close to one of these keywords or special verbs in Arabic text. To tag the name phrases in the text we look for the keywords and special verbs in the text to mark the name phrases. We assume the name should not be more than three words away from the keyword or the special verb. We also assume that the longest name is 7 words so we mark 10 words to the left of the keyword/special verb and 10 words to the right of the keyword to identify the name phrases. We collected tens of keywords/special verbs in a previous research project (Abuleil and Evens, 2002) and we classified them in different classes: people, locations, organizations, events and products. Table 1 shows some examples of these keywords and special verbs. Some keywords consist of two words. For example, the word “Foreign” is usually connected to the word “Minister” to form the keyword “Foreign Minister”. Other examples are سيئرلا بئان “Vice President” and مسأب ثدحتملا “Spokesperson”. KEYWORD / SPECIAL VERB MAIN TYPE SUB TYPE ريدم Manager Person Manager سيئر President Person President سردم Professor Person Professor ةلود Country Location Country ةنيدم City Location City ةفيحص Newspaper Organization Newspaper كنب Bank Organization Bank رمتؤم Conference Event Conference ضرعم Exhibit Event Exhibit برح War Event War ثدحت Said Person N/A حرص Announced Person N/A Table 1.
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملOptimizing question answering systems by Accelerated Particle Swarm Optimization (APSO)
One of the most important research areas in natural language processing is Question Answering Systems (QASs). Existing search engines, with Google at the top, have many remarkable capabilities. But there is a basic limitation (search engines do not have deduction capability), a capability which a QAS is expected to have. In this perspective, a search engine may be viewed as a semi-mechanized QA...
متن کاملNamed Entity Recognition Approaches
Recognizing and extracting exact name entities, like Persons, Locations and Organizations are very useful to mining information from text. Learning to extract names in natural language text is called Named Entity Recognition (NER) task. Proper named entity recognition and extraction is important to solve most problems in hot research area such as Question Answering and Summarization Systems, In...
متن کاملروشی جدید جهت استخراج موجودیتهای اسمی در عربی کلاسیک
In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...
متن کاملتشخیص اسامی اشخاص با استفاده از تزریق کلمههای نامزد اسم در میدانهای تصادفی شرطی برای زبان عربی
Named Entity Recognition and Extraction are very important tasks for discovering proper names including persons, locations, date, and time, inside electronic textual resources. Accurate named entity recognition system is an essential utility to resolve fundamental problems in question answering systems, summary extraction, information retrieval and extraction, machine translation, video interpr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004